Spatial statistics and ESDA

GEOG 40323

February 27, 2018

Statistics

  • Definition: the study of the collection, organization, analysis, interpretation, and presentation of data.
  • In turn, an understanding of statistics is fundamental to the GIS practitioner

Measures of central tendency

  • Mean
  • Median
  • Mode

Mean

The mean of a sample (\(\overline{x}\)) is calculated as follows:

\[\overline{x} = \dfrac{x_1 + x_2 + ... + x_n}{n}\]

where \(n\) is the number of elements in the sample.

Median

  • The value that separates the lower half of a sample from the upper half of a sample
  • If the sample has an odd number of cases, the median is the exact middle value; if the sample has an even number of cases, the median is the mean of the two middle values

Mode

  • Most frequent value in a sample

Variance

  • A measure of the spread of a sample. The variance is computed as:

\[{\sigma}^2 = \dfrac{\sum\limits_{i=1}^{n}(x_i - \overline{x})^2}{n}\]

or, in simpler terms, the average of the squared deviations of the values of a sample from its mean.

Standard deviation

  • Computed as the square root of the variance; denoted by \(\sigma\).
  • Offers a standardized way to discuss the spread of a distribution. For example, in a normally distributed sample (more to come on this):
    • About 67 percent of the values will be within one standard deviation of the mean
    • About 95 percent of the values will be within two standard deviations of the mean
    • About 99 percent of the values will be within three standard deviations of the mean

Z-score

  • Often, variables in your analysis will have vastly different measurement scales. Variables can thus be standardized with the computation of z-scores.
  • A z-score is computed as follows:

\[Z = \dfrac{x - \overline{x}}{\sigma}\]

  • In turn, a z-score reflects how many standard deviations an observation is away from the mean.

Probability

  • Probability: the likelihood of the occurrence of an event
  • Statistical significance: the probability that an observation/effect is not the result of random chance

The normal distribution

Implications of the distribution of your data

Source: http://www.southalabama.edu/coe/bset/johnson/lectures/lec15_files/image014.jpg

Correlation

  • Generally speaking, correlation refers to the extent to which two variables covary.
  • The most popular measure of correlation is Pearson’s product-moment correlation coefficient (or Pearson’s \(r\)), which is appropriate for two continuous variables.
  • \(r\) ranges from -1 to 1, where -1 reflects an inverse relationship between two variables, and 1 reflects a positive relationship. If \(r\) = 0, no apparent relationship exists.

Regression

  • Regression analysis is used to study the relationship between a response variable \(Y\) and a series of explanatory variables \(X_1 ... X_n\).
  • The general formula of a linear regression function is:

\[Y = a + b_{1}X_{1} + ... + b_{n}X_{n} + e\]

  • Our job is to estimate the unknown intercept \(a\) and the parameters \(b_1 ... b_n\). \(e\) refers to the error, or residuals, in our equation.

Spatial statistics

  • Statistical methods make a variety of assumptions about your data; for example, regression assumes that your residuals (errors) are independent and identically distributed (\(i.i.d\)).
  • However, we remember Tobler’s First Law of Geography:
  • Given this property of spatial data, we need special methods to account for spatial dependence.

Exploratory spatial data analysis (ESDA)

  • Data analysis techniques used to investigate spatial patterns in a dataset
  • Used to discover unseen features of datasets; often involves visualization

Mean and median center

  • Used to identify the central location of a distribution of features

Source: US Census Bureau

Standard deviation ellipse

  • Reveals the directional distribution of points; analogous to a standard deviation

Source: Esri

Nearest neighbor

  • Compares the average distance from a feature to its neighbor to an expected average under conditions of spatial randomness

Source: Keith Clarke, UCSB

Ripley’s K function

  • Checks for clustering/dispersion at multiple scales

Source: Esri

Spatial autocorrelation

  • Autocorrelation: the correlation of a variable with itself
  • Temporal autocorrelation: similar values in similar time periods
  • Spatial autocorrelation: when features tend to have similar values to their neighbors (Tobler’s first law of geography!)

Spatial weights matrix

  • A spatial weights matrix is used to conceptualize spatial relationships between features
  • Types: distance-decay, contiguity, nearest-neighbor

Source: biosolutions.us

Global measures of autocorrelation

  • Statistics designed to detect the presence of spatial autocorrelation in a dataset
  • Commonly used: Moran’s I, Geary’s C

Moran’s I

  • Most commonly-used global autocorrelation statistic
  • Values range from -1 to 1; negative values indicate uniformity, positive values indicate clustering

Local measures of spatial autocorrelation

  • Local statistic: statistic in a spatial dataset that varies from place to place
  • Whereas global measures of spatial autocorrelation can describe general characteristics of a dataset, local measures can illustrate where clustering is found in your data

Getis-Ord \(G_i\)

  • For a location \(i\), the local statistic \(G_i\) is computed as:

\[G_{i}(d) = \dfrac{\sum\limits_{j}w_{ij}(d)x_j}{\sum\limits_{j=1}^{n}x_j} \text{ for all }i \neq j\]

  • In simpler terms, \(G_i\) reflects the relationship between the values of a neighborhood and the overall values of a study area. High values reflect a clustering of high values; low values reflect a clustering of low values.
  • The \(G_i^{*}\) statistic is similar, but includes the value for location \(i\) in the neighborhood calculation.

Local indicators of spatial association (LISA)

  • Local form of the Moran’s I; used to identify clusters and outliers in a dataset

Spatial regression

  • Spatial dependence in your data violates some basic assumptions of the regression model
  • Alternative: spatial regression
  • Spatial lag model: The average neighborhood value of the response variable modeled as an explanatory variable
  • Spatial error model: Used when residuals (errors) exhibit spatial autocorrelation; accounts for spatial autocorrelation in the error term

Spatial statistics in ArcGIS

Other options for doing spatial analysis